For this exercise, we will use the same subset of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany data as in the presentation. Just run the following code to go through the wrangling pipeline. Remember that the .csv file should be stored in the data folder in the directory with the course materials.

library(tidyverse)
library(naniar)

gesis_panel_corona <- read_csv2("../../../data/ZA5667_v1-1-0.csv")

missings <- c(-111, -99, -77, -33, -22)

corona_survey <- gesis_panel_corona %>% 
  select(id,
         sex:education_cat,
         choice_of_party,
         left_right = political_orientation,
         risk_self =  hzcy001a,
         risk_surround =  hzcy002a,
         avoid_places =  hzcy006a,
         keep_distance =  hzcy007a,
         wash_hands = hzcy011a,
         stockup_supplies =  hzcy013a,
         reduce_contacts =  hzcy014a,
         wear_mask = hzcy015a,
         trust_rki = hzcy047a,
         trust_government = hzcy048a,
         trust_chancellor = hzcy049a,
         trust_who = hzcy051a,
         trust_scientists = hzcy052a,
         info_national_public_tv = hzcy084a,
         info_national_newspaper = hzcy086a,
         info_local_newspaper = hzcy089a,
         info_facebook = hzcy090a,
         info_other_social_media = hzcy091a) %>% 
  replace_with_na_all(condition = ~.x %in% missings) %>% 
    replace_with_na(replace = list(choice_of_party = c(97,98),
                                   risk_self = c(97),
                                   risk_surround = c(97),
                                   trust_rki = c(98),
                                   trust_government = c(98),
                                   trust_chancellor = c(98),
                                   trust_who = c(98),
                                   trust_scientists = c(98))) %>%
  mutate(sex = recode_factor(sex,
                               `1`= "Male",
                               `2` = "Female"),
           education_cat = recode_factor(education_cat,
                                       `1` = "Low",
                                       `2` = "Medium",
                                       `3`= "High",
                                       .ordered = TRUE),
           age_cat = recode_factor(age_cat,
                                   `1`= "<= 25 years",
                                   `2`= "26 to 30 years",
                                   `3` = "31 to 35 years",
                                   `4` = "36 to 40 years",
                                   `5` = "41 to 45 years",
                                   `6` = "46 to 50 years",
                                   `7` = "51 to 60 years",
                                   `8` = "61 to 65 years",
                                   `9`= "66 to 70 years",
                                   `10` = ">= 71 years",
                                   .ordered = TRUE),
           choice_of_party = recode_factor(choice_of_party,
                                           `1`= "CDU/CSU",
                                           `2`= "SPD",
                                           `3` = "FDP",
                                           `4` = "Linke",
                                           `5` = "Gruene",
                                           `6` = "AfD",
                                           `7` = "Other")
    ) %>% 
  mutate(sum_measures = avoid_places + 
           keep_distance + 
           wash_hands + 
           stockup_supplies + 
           reduce_contacts + 
           wear_mask,
         sum_sources = info_national_public_tv + 
           info_national_newspaper + 
           info_local_newspaper + 
           info_facebook + 
           info_other_social_media) %>% 
  rowwise() %>% 
  mutate(mean_trust = mean(c(trust_rki, 
                             trust_government, 
                             trust_chancellor, 
                             trust_who, 
                             trust_scientists),
                           na.rm = TRUE)) %>% 
  ungroup()

As we will use the same dataset again in the next exercise in this session, it makes sense to save it. To preserve the information about the variable types, it is best to save it as a .rds file. You can do this with the following command:

saveRDS(corona_survey, "../data/gp_corona_subset.rds")

In case you have not done so, please also install the summarytools and the GGally package. The following code chunk will check if you have these packages installed and install them, if that is not the case.

if (!require(summaryrtools)) install.packages("summarytools")
if (!require(summaryrtools)) install.packages("GGally")

1

Print a simple table with the frequencies of the variable education_cat. Also include the counts for missing values.
You can use the table() function from base R for this.
table(corona_survey$age_cat, useNA = "always")
## 
##    <= 25 years 26 to 30 years 31 to 35 years 36 to 40 years 41 to 45 years 
##            107            267            276            328            317 
## 46 to 50 years 51 to 60 years 61 to 65 years 66 to 70 years    >= 71 years 
##            367            978            386            357            382 
##           <NA> 
##              0

In the following, we will use different joins to create datasets that contain the same set of variables. We will create two versions of the combined dataset.

Before we do this, however, we want to explore the overlap and discrepancies between the individual datasets. This is somewhat easier to do with the datasets in wide format (as each country name only appears in one row in those).

2

Use a function from the summarytools package to get summary statistics for the following variables in your dataset: left_right, sum_measures, mean_trust.
You need to combine a wrangling function from the dplyr package with descr() from summarytools.
library(summarytools)

corona_survey %>% 
  select(left_right,
         sum_measures,
         mean_trust) %>%
  descr()
## Descriptive Statistics  
## corona_survey  
## N: 3765  
## 
##                     left_right   mean_trust   sum_measures
## ----------------- ------------ ------------ --------------
##              Mean         4.66         3.98           3.77
##           Std.Dev         1.86         0.75           1.16
##               Min         0.00         1.00           0.00
##                Q1         3.00         3.60           3.00
##            Median         5.00         4.00           4.00
##                Q3         6.00         4.60           5.00
##               Max        10.00         5.00           6.00
##               MAD         1.48         0.59           1.48
##               IQR         3.00         1.00           2.00
##                CV         0.40         0.19           0.31
##          Skewness        -0.10        -0.94          -1.14
##       SE.Skewness         0.04         0.04           0.04
##          Kurtosis        -0.16         1.01           1.43
##           N.Valid      3678.00      3157.00        3186.00
##         Pct.Valid        97.69        83.85          84.62

3

Use another function from summarytools to display the counts and frequencies for the categories in the age_cat variable.
The function you need is freq()
freq(corona_survey$age_cat)
## Frequencies  
## corona_survey$age_cat  
## Type: Ordered Factor  
## 
##                        Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## -------------------- ------ --------- -------------- --------- --------------
##          <= 25 years    107      2.84           2.84      2.84           2.84
##       26 to 30 years    267      7.09           9.93      7.09           9.93
##       31 to 35 years    276      7.33          17.26      7.33          17.26
##       36 to 40 years    328      8.71          25.98      8.71          25.98
##       41 to 45 years    317      8.42          34.40      8.42          34.40
##       46 to 50 years    367      9.75          44.14      9.75          44.14
##       51 to 60 years    978     25.98          70.12     25.98          70.12
##       61 to 65 years    386     10.25          80.37     10.25          80.37
##       66 to 70 years    357      9.48          89.85      9.48          89.85
##          >= 71 years    382     10.15         100.00     10.15         100.00
##                 <NA>      0                               0.00         100.00
##                Total   3765    100.00         100.00    100.00         100.00

4

Use yet another summarytools function to create a crosstable for the variables sex and education_cat.
The function for this is ctable().
ctable(corona_survey$sex, corona_survey$education_cat)
## Cross-Tabulation, Row Proportions  
## sex * education_cat  
## Data Frame: corona_survey  
## 
## -------- --------------- ------------- -------------- -------------- ---------------
##            education_cat           Low         Medium           High           Total
##      sex                                                                            
##     Male                   255 (13.2%)    526 (27.2%)   1152 (59.6%)   1933 (100.0%)
##   Female                   168 ( 9.2%)    628 (34.3%)   1036 (56.6%)   1832 (100.0%)
##    Total                   423 (11.2%)   1154 (30.7%)   2188 (58.1%)   3765 (100.0%)
## -------- --------------- ------------- -------------- -------------- ---------------

5

Use the correlation package to calculate and print correlations between the following variables: risk_self, risk_surround, sum_measures, sum_sources
You need to use select from dplyr and correlation() from the package with the same name.
library(correlation)

corona_survey %>% 
  select(risk_self,
         risk_surround,
         sum_measures,
         sum_sources) %>% 
  correlation()
## Parameter1    |    Parameter2 |    r |       95% CI |     t |   df |      p |  Method | n_Obs
## ---------------------------------------------------------------------------------------------
## risk_self     | risk_surround | 0.76 | [0.75, 0.78] | 65.29 | 3075 | < .001 | Pearson |  3077
## risk_self     |  sum_measures | 0.16 | [0.13, 0.20] |  9.29 | 3146 | < .001 | Pearson |  3148
## risk_self     |   sum_sources | 0.06 | [0.03, 0.10] |  3.62 | 3129 | < .001 | Pearson |  3131
## risk_surround |  sum_measures | 0.14 | [0.11, 0.17] |  7.89 | 3098 | < .001 | Pearson |  3100
## risk_surround |   sum_sources | 0.09 | [0.06, 0.13] |  5.06 | 3081 | < .001 | Pearson |  3083
## sum_measures  |   sum_sources | 0.13 | [0.09, 0.16] |  7.16 | 3166 | < .001 | Pearson |  3168

6

As a final task in this exercise on EDA, plot the above correlations with a function from the GGally package. The plot should include the coefficients rounded to two decimal places as labels.
The required function is ggcorr().
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
corona_survey %>% 
  select(risk_self,
         risk_surround,
         sum_measures,
         sum_sources) %>% 
  ggcorr(label = TRUE,
         label_round = 2)